The TSstudio provides a set of tools for time series analysis and forecasting applications, using (mainly) the forecast package models and the plotly package data visualization engine. The following document demonstrates the usage of the package to forecast the monthly consumption of natural gas in the US in the next 5 years (or 60 months)
Install from CRAN:
install.packages("TSstudio")
Or from Github:
devtools::install_github("RamiKrispin/TSstudio")
As for December 2018, the most updated version is 0.1.3
library(TSstudio)
The USgas dataset, one of the package datasets, represents the monthly consumption (in Billion Cubic Feet) of natural gas in the US since January 2000 [1]:
# Load the series
data("USgas")
# Get the series info
ts_info(USgas)
## The USgas series is a ts object with 1 variable and 227 observations
## Frequency: 12
## Start time: 2000 1
## End time: 2018 11
# Plot the series
ts_plot(USgas,
title = "US Monthly Natural Gas Consumption",
Ytitle = "Billion Cubic Feet",
Xtitle = "Source: Federal Reserve Bank of St. Louis",
slider = TRUE)
The goal of the exploratory analysis is to identify the key characteristics of the series with the use of descriptive analysis methods. Throughout this process we are mainly want to detect:
Those insights provide us with a better understanding of the past and can be utilized to forecast the future.
We will start with decomposing the series into its three components - the trend, seasonal and random components. The ts_decompose function provides an interactive inference for the decompose function form the stats package:
ts_decompose(USgas)
We can observe that the trend of the series is fairly flat up to 2010 and afterward start to increase. Also, it seems from the trend plot that it is not linear.
You can note from both the series and decompose plots that the series has a strong seasonal pattern along with a non-linear trend. We will use the ts_seasonal, ts_heatmap and ts_surface functions to explore the seasonal pattern of the series:
ts_seasonal(USgas, type = "all")
The ts_seasonal function provides three different views of the seasonality of the series (when the type argument is equal to all):
Split and plot the series by the full frequency cycle of the series, which in the case of monthly series is a full year. This view allows you to observe and compare the variation of each frequency unit from year to year. The plot’s color scale set by the chronological order of the lines (i.e., from dark color for early years and bright colors for the latest years).
Plot each frequency unit over time, and in the case of monthly series, each line represents a month of consumption over time. This allows us to observe if the seasonal pattern remains the same over time.
Last but not least, is box-plot representative of each frequency unit, which allows us to compare the distribution of each frequency unit.
The main observations from this set of plots are:
To get a more clear view of the seasonal pattern of the series, you may want to remove the series growth (or detrend) and replot it:
ts_seasonal(USgas - decompose(USgas)$trend, type = "all")
Alternatively, you can use ts_heatmap or the ts_surface function to view and explore the seasonal pattern of the series:
ts_heatmap(USgas)
ts_surface(USgas)
The next step is to identify the level of correlation between the series and it’s lags, using the ts_acf function (an interactive interface for the acf function):
ts_acf(USgas, lag.max = 36)
As expected you can notice that the series has a high correlation with its seasonal lags. A more intuitive way to review identify the relationship between the series and its lags is with the ts_lags function, which provides plots of the series against its lags:
ts_lags(USgas)
As observed before with the ts_acf function, you can notice that the seasonal lag (or lag 12) of the series has a strong linear relationship with the series. Similarly, we can zoom in on the seasonal lags of the series using the lags argument:
ts_lags(USgas, lags = c(12, 24, 36, 48))
Here are the key insights we learned from the exploratory analysis process:
Using the information we learned from the seasonal and correlation analysis, we will conduct “horse race” between several models and select the one that performe best on the testing set, by using two training approaches:
Traditional approach - by splitting the series into training and testing (sample out) partitions. Train each model on the training set and evluate its perfromance on the testing set. We will use the following four models - auto.arima, ets, nnetar and tslm from the forecast package
Backtesting - by using expending window to train and test each model on multiple training and testing sets. We will utilize the ts_backtesting function to train multiple models from the forecast, forecastHybrid, and bsts packages.
# Set the sample out and forecast horizon
h1 <- 12 # sample out lenght
h2 <- 60 # forecast horizon
USgas_split <- ts_split(USgas, sample.out = h1)
train <- USgas_split$train
test <- USgas_split$test
ts_info(train)
## The train series is a ts object with 1 variable and 215 observations
## Frequency: 12
## Start time: 2000 1
## End time: 2017 11
ts_info(test)
## The test series is a ts object with 1 variable and 12 observations
## Frequency: 12
## Start time: 2017 12
## End time: 2018 11
set.seed(1234)
library(forecast)
# auto.arima
md1 <- auto.arima(train,
stepwise = FALSE,
approximation = FALSE,
D = 1)
fc1 <- forecast(md1, h = h1)
accuracy(fc1, test)
## ME RMSE MAE MPE MAPE MASE
## Training set -1.586018 98.37274 72.07709 -0.2748132 3.497391 0.6718145
## Test set 198.515290 216.81325 200.60105 7.9783005 8.055568 1.8697577
## ACF1 Theil's U
## Training set 0.02180888 NA
## Test set -0.02214037 0.815383
# ETS
md2 <- ets(train, opt.crit = "mse")
fc2 <- forecast(md2, h = h1)
accuracy(fc2, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 4.530822 99.61901 73.11756 0.07081195 3.561811 0.6815125
## Test set 114.734134 173.88352 155.67014 4.94191827 6.458404 1.4509667
## ACF1 Theil's U
## Training set 0.06787102 NA
## Test set 0.41614139 0.6901814
# Neural network time series forecasts
md3 <- nnetar(train,
P = 2, # using 2 seasonal lags
p = 1, # and 1 non-seasonal lags
repeats = 100)
fc3 <- forecast(md3, h = h1)
accuracy(fc3, test)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01423023 119.9402 94.28152 -0.3203614 4.587159 0.8787771
## Test set 212.30940334 237.3941 212.30940 8.1450374 8.145037 1.9788886
## ACF1 Theil's U
## Training set 0.3865153 NA
## Test set 0.4343893 0.8083805
# Time series linear regression
md4 <- tslm(train ~ season + trend + I(trend ^ 2))
fc4 <- forecast(md4, h = h1)
accuracy(fc4, test)
## ME RMSE MAE MPE
## Training set -0.00000000000001585288 105.1812 76.25964 -0.2349141
## Test set 88.96024173258881262427 134.7567 118.67616 3.4109552
## MAPE MASE ACF1 Theil's U
## Training set 3.706897 0.7107991 0.47320514 NA
## Test set 4.559134 1.1061541 -0.03942927 0.4659541
We will utlize the test_forecast function to visualize the goodness of fit of each model on both the training and testing partitions:
test_forecast(forecast.obj = fc1, actual = USgas, test = test)
test_forecast(forecast.obj = fc2, actual = USgas, test = test)
test_forecast(forecast.obj = fc3, actual = USgas, test = test)
test_forecast(forecast.obj = fc4, actual = USgas, test = test)